library(tidyverse)
library(knitr)
library(sjPlot)
library(gridExtra)
library(lme4)
library(emmeans)
library(car) # for vif
library(bbmle) # for AICtab
library(broom) # for glance
theme_set(ggthemes::theme_few())
Mixed modeling with all relevant variables predicting accuracy
From the preregistration, the mixed model was specified thusly:
correct ~ delay * age +
task_experience + cup_distance + board_size + trial +
(1 + delay + trial | site/subject/block/hiding_location ) +
(1 + task_experience + cup_distance + board_size + trial + delay | species)
In the dataframe, subject_site = subject, and norm_age should be used for age.
Model as pre-registered has too many random effects
Error: number of observations (=6246) < number of random effects (=10608) for term (1 + delay + trial | hiding_location:(block:(subject_site:site))); the random-effects parameters are probably unidentifiable
Pruning random effects in the following order (from preregistration):
- Remove correlations between random effects
- Remove random slopes (in the following order)
specieshiding_locationblocksubject
Model only converges once we take out hiding_location. After doing so, the other random effects (correlation, site, species) can be put back in.
The model below converges. Model output is saved in 06_mp_model_v2.rds
correct ~ delay * norm_age +
task_experience + cup_distance + board_size + trial +
(1 + delay + trial | site/subject_site) +
(1 + task_experience + cup_distance + board_size + trial + delay | species)
After pruning random effects with little variability and removing board_size, which covaried with cup_distance, the reduced model has the following structure. It is saved in 06_mp_3_model3_v2.rds
correct ~ delay * norm_age +
task_experience + cup_distance + board_size + trial +
(1 + delay | site/subject_site) +
(1 + delay | species)
Data import
mp_data <- read.csv("../data/merged_data/01_manyprimates_pilot_merged_data_v2.csv")
Prepare code for pre-registered mixed modeling
cup_distance, board_size and trialmodel.data <- mp_data %>%
filter(species != "black_faced_spider_monkey") %>%
mutate_at(vars(cup_distance, board_size, trial), funs(scale(.)[, 1])) %>%
mutate(hiding_location = factor(hiding_location),
delay = fct_relevel(delay, "short"))
The model takes a while to run. Run next line to load model output from previous run with structure below.
mm.1 <- readRDS("06_mp_model1.rds")
mm.1 <- glmer(correct ~ delay * norm_age +
task_experience + cup_distance + board_size + trial +
(1 + delay + trial | site/subject_site/block) +
(1 + task_experience + cup_distance + board_size + trial + delay | species)
, data = model.data
, family = binomial
, control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e5))
)
saveRDS(mm.1, "06_mp_model1.rds")
Some diagnostics
theta <- getME(mm.1, "theta")
diag.element <- getME(mm.1, "lower") == 0
any(theta[diag.element] < 1e-5)
[1] TRUE
Confirm model structure
# mm.1@call
formula(mm.1)
correct ~ delay * norm_age + task_experience + cup_distance +
board_size + trial + (1 + delay + trial | site/subject_site) +
(1 + task_experience + cup_distance + board_size + trial +
delay | species)
glance(mm.1) %>% kable(digits = 2)
| sigma | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|
| 1 | -3388.28 | 6892.57 | 7283.47 | 6365.11 | 6188 |
fmt <- function(num, digits) return(round(num, digits))
VarCorr(mm.1) %>% print(formatter = fmt, digits = 3) # comp = c("Variance", "Std.Dev.")
Groups Name Std.Dev. Corr
subject_site:site (Intercept) 0.862
delaylong 0.646 -0.91
delaymedium 0.528 -0.85 0.99
trial 0.093 -0.31 0.61 0.71
site (Intercept) 0.917
delaylong 0.512 -1.00
delaymedium 0.627 -1.00 1.00
trial 0.084 1.00 -1.00 -1.00
species (Intercept) 0.835
task_experienceyes 0.173 -1.00
cup_distance 0.012 1.00 -1.00
board_size 0.232 -1.00 1.00 -1.00
trial 0.021 -1.00 1.00 -1.00 1.00
delaylong 0.494 -1.00 1.00 -1.00 1.00 1.00
delaymedium 0.436 -1.00 1.00 -1.00 1.00 1.00 1.00
CIs
mm.1.ci <- confint(mm.1, method = "Wald") %>% # bootstrap these later
as.data.frame %>%
rownames_to_column %>%
filter(complete.cases(.)) %>%
rename(LL = `2.5 %`, UL = `97.5 %`) %>%
mutate(OR_LL = exp(LL), OR_UL = exp(UL))
coef(summary(mm.1)) %>%
as.data.frame %>%
rownames_to_column() %>%
mutate(OR = exp(Estimate)) %>%
left_join(mm.1.ci, by = "rowname") %>%
select(rowname, OR, OR_LL, OR_UL, Estimate, LL, UL, everything()) %>%
kable(digits = 3)
| rowname | OR | OR_LL | OR_UL | Estimate | LL | UL | Std. Error | z value | Pr(>|z|) |
|---|---|---|---|---|---|---|---|---|---|
| (Intercept) | 4.135 | 1.703 | 10.040 | 1.420 | 0.532 | 2.307 | 0.453 | 3.136 | 0.002 |
| delaylong | 0.284 | 0.169 | 0.477 | -1.257 | -1.775 | -0.740 | 0.264 | -4.763 | 0.000 |
| delaymedium | 0.369 | 0.213 | 0.638 | -0.997 | -1.545 | -0.450 | 0.279 | -3.570 | 0.000 |
| norm_age | 0.987 | 0.796 | 1.224 | -0.013 | -0.228 | 0.202 | 0.110 | -0.117 | 0.907 |
| task_experienceyes | 1.234 | 0.784 | 1.940 | 0.210 | -0.243 | 0.663 | 0.231 | 0.908 | 0.364 |
| cup_distance | 1.554 | 1.253 | 1.927 | 0.441 | 0.226 | 0.656 | 0.110 | 4.014 | 0.000 |
| board_size | 1.779 | 1.263 | 2.506 | 0.576 | 0.233 | 0.919 | 0.175 | 3.296 | 0.001 |
| trial | 1.046 | 0.953 | 1.148 | 0.045 | -0.048 | 0.138 | 0.047 | 0.948 | 0.343 |
| delaylong:norm_age | 1.049 | 0.854 | 1.289 | 0.048 | -0.158 | 0.253 | 0.105 | 0.454 | 0.650 |
| delaymedium:norm_age | 1.082 | 0.887 | 1.320 | 0.079 | -0.120 | 0.278 | 0.102 | 0.775 | 0.438 |
corr <- cov2cor(vcov(mm.1)) %>% as.matrix %>% round(2)
corr[upper.tri(corr, diag = T)] <- ""
colnames(corr) <- 1:10
rownames(corr) <- str_c(1:10, " ", rownames(corr))
corr %>% as.data.frame %>% select(-10) %>% rownames_to_column
based on estimated marginal means
Note. This wasn’t in the preregistration.
emmeans(mm.1, pairwise ~ delay, type = "response")$contrasts
contrast odds.ratio SE df z.ratio p.value
short / long 3.5160746 0.92828519 Inf 4.762 <.0001
short / medium 2.7103806 0.75716209 Inf 3.569 0.0010
long / medium 0.7708541 0.07335892 Inf -2.735 0.0172
Results are averaged over the levels of: task_experience
P value adjustment: tukey method for comparing a family of 3 estimates
Tests are performed on the log odds ratio scale
plot_model(mm.1, title = "Fixed Effects", order.terms = c(7, 4, 3:1, 9:8, 5, 6),
width = .3, show.values = T, value.size = 2.5, value.offset = .3) +
geom_hline(yintercept = 1, lty = 2) +
ylim(0, 3)
ranef.plots <- plot_model(mm.1, type = "re", sort.est = "(Intercept)")
ranef.plots[[1]]
ranef.plots[[2]]
ranef.plots[[3]]
trial random slopes within species as the estimates in the previous models were essentially 0trial from the random slopes for subject/site for the same reasoncorrect ~ delay * norm_age +
task_experience + cup_distance + board_size + trial +
(1 + delay | site/subject_site ) +
(1 + task_experience + cup_distance + board_size + delay | species)
col.mm1 <- glm(correct ~ delay + norm_age +
task_experience + cup_distance + board_size + trial
, data = model.data
, family = binomial)
vif(col.mm1)
GVIF Df GVIF^(1/(2*Df))
delay 1.007876 2 1.001963
norm_age 1.072757 1 1.035740
task_experience 1.056383 1 1.027805
cup_distance 1.323990 1 1.150648
board_size 1.296698 1 1.138726
trial 1.001067 1 1.000533
No signs of high colinearity.
Check how many different levels there are within each random effect
source("diagnostic_fcns.r")
overview <- fe.re.tab("correct ~ delay + task_experience + board_size + cup_distance + trial", "species", data = model.data)
overview$summary
$`delay_within_species (factor)`
3
11
$`task_experience_within_species (factor)`
1 2
9 2
$`board_size_within_species (covariate)`
1 2 4
5 5 1
$`cup_distance_within_species (covariate)`
1 2 4
6 4 1
$`trial_within_species (covariate)`
36
11
This suggests that, within species, random slopes for task_experience does not make much sense as most species have only 1 level. Same is true for cup_distance and board_size. Indeed, the model summary and random effects plot for species confirm that there is little variability in these estimates (they’re close to zero). Therefore they are removed.
correct ~ delay * norm_age +
task_experience + cup_distance + board_size + trial +
(1 + delay + trial | site/subject_site ) +
(1 + delay | species)
mm.2 <- readRDS("06_mp_model2.rds")
mm.2 <- glmer(correct ~ delay * norm_age +
task_experience + cup_distance + board_size + trial +
(1 + delay | site/subject_site) +
(1 + delay | species)
, data = model.data
, family = binomial
, control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e5))
)
saveRDS(mm.2, "06_mp_model2.rds")
Confirm model structure
formula(mm.2)
correct ~ delay * norm_age + task_experience + cup_distance +
board_size + trial + (1 + delay | site/subject_site) + (1 +
delay | species)
glance(mm.2) %>% kable(digits = 2)
| sigma | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|
| 1 | -3391.26 | 6838.52 | 7027.23 | 6389.41 | 6218 |
drop1(mm.2, test = 'Chisq')
Single term deletions
Model:
correct ~ delay * norm_age + task_experience + cup_distance +
board_size + trial + (1 + delay | site/subject_site) + (1 +
delay | species)
Df AIC LRT Pr(Chi)
<none> 6838.5
task_experience 1 6836.5 0.0074 0.9314327
cup_distance 1 6848.6 12.0934 0.0005060 ***
board_size 1 6847.9 11.3967 0.0007357 ***
trial 1 6837.1 0.5912 0.4419742
delay:norm_age 2 6835.3 0.8225 0.6628194
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
VarCorr(mm.2) %>% print(comp = c("Variance", "Std.Dev."), formatter = fmt, digits = 3)
Groups Name Variance Std.Dev. Corr
subject_site:site (Intercept) 0.734 0.857
delaylong 0.416 0.645 -0.90
delaymedium 0.261 0.511 -0.85 1.00
site (Intercept) 1.087 1.043
delaylong 0.367 0.606 -1.00
delaymedium 0.456 0.675 -1.00 1.00
species (Intercept) 0.481 0.694
delaylong 0.214 0.462 -1.00
delaymedium 0.157 0.396 -1.00 1.00
CIs
mm.2.ci <- readRDS("06_mp_model2_ci.rds")
# Bootstrap function by Roger Mundry @ MPI EVA
source("boot_glmm.r")
mm.2.ci <- boot.glmm.pred(model.res = mm.2, excl.warnings = F, nboots = 1000,
para = F, resol = 100, level = 0.95, use = NULL,
circ.var.name = NULL, circ.var = NULL, use.u = F,
n.cores = c("all-1", "all"), save.path = NULL)
saveRDS(mm.2.ci, "06_mp_model2_ci.rds")
coef(summary(mm.2)) %>%
as.data.frame %>%
rownames_to_column() %>%
mutate(OR = exp(Estimate)) %>%
cbind(mm.2.ci$ci.estimates[, 2:3]) %>%
select(rowname, OR, Estimate, X2.5., X97.5., everything()) %>%
kable(digits = 3)
| rowname | OR | Estimate | X2.5. | X97.5. | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | (Intercept) | 5.805 | 1.759 | 0.978 | 2.541 | 0.448 | 3.925 | 0.000 |
| delaylong | delaylong | 0.240 | -1.426 | -1.943 | -0.939 | 0.280 | -5.092 | 0.000 |
| delaymedium | delaymedium | 0.313 | -1.161 | -1.663 | -0.665 | 0.281 | -4.130 | 0.000 |
| norm_age | norm_age | 0.992 | -0.008 | -0.212 | 0.211 | 0.109 | -0.078 | 0.938 |
| task_experienceyes | task_experienceyes | 1.016 | 0.015 | -0.282 | 0.350 | 0.173 | 0.089 | 0.929 |
| cup_distance | cup_distance | 1.462 | 0.380 | 0.169 | 0.613 | 0.103 | 3.677 | 0.000 |
| board_size | board_size | 1.472 | 0.386 | 0.198 | 0.593 | 0.094 | 4.097 | 0.000 |
| trial | trial | 1.024 | 0.023 | -0.037 | 0.086 | 0.030 | 0.774 | 0.439 |
| delaylong:norm_age | delaylong:norm_age | 1.052 | 0.051 | -0.162 | 0.249 | 0.106 | 0.479 | 0.632 |
| delaymedium:norm_age | delaymedium:norm_age | 1.093 | 0.089 | -0.117 | 0.286 | 0.101 | 0.883 | 0.377 |
based on estimated marginal means
emmeans(mm.2, pairwise ~ delay)$contrasts
contrast estimate SE df z.ratio p.value
short - long 1.4257372 0.2800475 Inf 5.091 <.0001
short - medium 1.1604769 0.2810297 Inf 4.129 0.0001
long - medium -0.2652602 0.0865917 Inf -3.063 0.0062
Results are averaged over the levels of: task_experience
Results are given on the log odds ratio (not the response) scale.
P value adjustment: tukey method for comparing a family of 3 estimates
code to bootstrap these CIs (confint(emmeans(...) gets Wald CIs)
boot.pairs <- function(m) suppressMessages(summary(pairs(emmeans(m, "delay")))$estimate)
boot.emm <- function(m, n = 1000) {
boo <- bootMer(m, FUN = boot.pairs, nsim = n, parallel = "multicore", ncpus = 4)
ci <- sapply(1:3, function(j) quantile(boo$t[,j], prob = c(.025, .975), na.rm = T))
ci <- t(rbind(boo$t0, ci))
rownames(ci) <- c("short vs. long", "short vs. medium", "medium vs. long")
return(ci)
}
This took ~13 hours to run on 4 cores on my machine (jw). Skip this chunk and run the next one to load CIs from previous run.
system.time(mm2.emm.ci <- boot.emm(mm.2, n = 1000))
saveRDS(mm2.emm.ci, "06_mp_model2_emm_ci.rds")
user system elapsed
130648.652 1127.768 47989.548
mm2.emm.ci <- readRDS("06_mp_model2_emm_ci.rds")
round(mm2.emm.ci, 2)
2.5% 97.5%
short vs. long 1.43 0.95 2.01
short vs. medium 1.16 0.67 1.73
medium vs. long -0.27 -0.44 -0.10
odds ratios
round(exp(mm2.emm.ci), 2)
2.5% 97.5%
short vs. long 4.16 2.58 7.46
short vs. medium 3.19 1.94 5.65
medium vs. long 0.77 0.64 0.90
fe.2 <- plot_model(mm.2, title = "A. Fixed Effects",
axis.labels = c("Trial", "Task Experience (yes)","Board Size (cm)",
"Cup Distance (cm)", "Normed Age x Delay\n(short vs. long)",
"Normed Age x Delay\n(short vs. medium)", "Normed Age",
"Delay (short vs. long)", "Delay (short vs. medium)"),
order.terms = c(2:1, 3, 9:8, 5, 6, 4, 7), wrap.labels = F,
width = .3, show.values = T, value.size = 2.5, value.offset = .3) +
facet_wrap(~ "") +
# geom_hline(yintercept = 1, lty = 2) +
scale_y_continuous(trans = "log", limits = c(.1, 5.5), breaks = c(.1, .2, .5, 1, 2, 5))
fe.2 + theme(plot.margin = unit(c(.5, 7, .5, .5), "cm"))
ggsave("../graphs/05_forestplot.png", fe.2, width = 3, height = 2.5, scale = 2)
ranef.plots2 <- plot_model(mm.2, type = "re", sort.est = "(Intercept)")
ranef.plots2[[1]]
ranef.plots2[[2]]
ranef.plots2[[3]]
mm.3 <- glmer(correct ~ delay * norm_age +
task_experience + cup_distance + trial +
(1 + delay | species)
, data = model.data
, family = binomial
, control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e5))
)
phylo <- read_csv("../data/species_data.csv") %>% select(species, species_formatted, phylo)
plot.data <- get_model_data(mm.3, type = "re", show.values = T) %>%
left_join(phylo, by = c("term" = "species")) %>%
rename(species = species_formatted) %>%
mutate(
facet = fct_rev(facet),
facet = fct_recode(facet, "Intercept" = "species (Intercept)",
"Delay (short vs. long)" = "delaylong",
"Delay (short vs. medium)" = "delaymedium")
)
Column `term`/`species` joining factor and character vector, coercing into character vector
sorted <- filter(plot.data, facet == "Intercept") %>% arrange(estimate) %>% with(species)
plot.data <- mutate(plot.data, species = factor(species, levels = sorted))
re.3 <- ggplot(plot.data, aes(x = species, y = estimate, col = group)) +
facet_grid(~ facet) +
geom_hline(yintercept = 1, col = "grey90", size = 1.125) +
# geom_hline(yintercept = 1, lty = 2) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = .3) +
geom_point(size = 2.5) +
geom_text(aes(label = p.label), nudge_x = .3, size = 2.5, show.legend = F) +
scale_y_continuous("Odds Ratios", trans = "log", breaks = c(.1, .2, .5, 1, 2, 5,10,20,40)) +
scale_colour_brewer(palette = "Set1") +
coord_flip(ylim = c(.1, 42.5)) + xlab("") +
guides(col = "none") +
ggtitle("B. Species Random Effects")
mat <- matrix(c(rep(1, 3), rep(2, 7)), nrow = 1)
grid.arrange(fe.2, re.3, layout_matrix = mat)
ggsave("../graphs/05_forestplot_fe_re.png", arrangeGrob(fe.2, re.3, layout_matrix = mat), width = 8, height = 3, scale = 1.8)
ggsave("../graphs/Fig3.tiff", arrangeGrob(fe.2, re.3, layout_matrix = mat), width = 8, height = 3, scale = 1.8, type = "cairo", compression = "lzw")
We’re looking for the lowest AIC(c) as the model with the ‘best fit’ with a reasonable number of parameters. (Too many are penalized by AIC as one way to address overfitting.)
Indeed, the reduced model seems to do a better job of striking that balance between fitting the data with fewer parameters.
AICctab(mm.1, mm.2, mm.3, logLik = T, weights = T)
dLogLik dAICc df weight
mm.2 109.9 0.0 28 1
mm.1 112.9 54.9 58 <0.001
mm.3 0.0 193.6 15 <0.001
anova(mm.1, mm.2, mm.3)
Data: model.data
Models:
mm.3: correct ~ delay * norm_age + task_experience + cup_distance +
mm.3: trial + (1 + delay | species)
mm.2: correct ~ delay * norm_age + task_experience + cup_distance +
mm.2: board_size + trial + (1 + delay | site/subject_site) + (1 +
mm.2: delay | species)
mm.1: correct ~ delay * norm_age + task_experience + cup_distance +
mm.1: board_size + trial + (1 + delay + trial | site/subject_site) +
mm.1: (1 + task_experience + cup_distance + board_size + trial +
mm.1: delay | species)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
mm.3 15 7032.3 7133.4 -3501.1 7002.3
mm.2 28 6838.5 7027.2 -3391.3 6782.5 219.7490 13 <2e-16 ***
mm.1 58 6892.6 7283.5 -3388.3 6776.6 5.9521 30 1
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Difference
coef1 <- coef(summary(mm.1))[c(2, 3, 6), 1]
coef2 <- coef(summary(mm.2))[c(2, 3, 6), 1]
coef2 - coef1
delaylong delaymedium cup_distance
-0.16840056 -0.16341754 -0.06131301
Difference in odds ratios
exp(coef2) - exp(coef1)
delaylong delaymedium cup_distance
-0.04407273 -0.05561161 -0.09242183